%
% File iscol2014.tex
%
%
%

\documentclass[11pt]{article}
\usepackage{coling2014}
\usepackage{times}
\usepackage{url}
\usepackage{latexsym}
\usepackage{graphics,qtree}
\usepackage{multirow}
% for underlining specific words
\usepackage[normalem]{ulem}
\usepackage{amsmath}


\title{Generating Subjective  Responses  to Opinionated Articles in Social Media: \\An Agenda-Driven  Architecture and a Turing-Like Test}

 \author{Tomer Cagan \\
  \normalsize School of Computer Science \\
  \normalsize The Interdisciplinary Center \\
  \normalsize Herzeliya, Israel  \\
  \normalsize {\tt cagan.tomer@idc.ac.il} \\\And
  Stefan L. Frank \\
  \normalsize Centre for Language Studies \\
  \normalsize Radboud University  \\
  \normalsize Nijmegen, The Netherlands \\
  \normalsize {\tt s.frank@let.ru.nl } \\\And
  Reut Tsarfaty \\
  \normalsize Mathematics and Computer Science \\
  \normalsize Weizmann Institute of Science \\
  \normalsize Rehovot, Israel  \\
  \normalsize {\tt tsarfaty@weizmann.ac.il}  \\}

\date{}

\begin{document}
\maketitle
Natural language traffic in social media % (blogs, microblogs, talkbacks)
enjoys vast monitoring and analysis efforts. However, the question whether computer systems can generate such content has been only sparsely attended to. This paper presents an end-to-end architecture for generating subjective responses to online articles. We aim to generate responses that promote specific users' agendas, ones that may be different than the opinion reflected in the article. Our generation system integrates users' agendas, documents' topics, sentiment analysis  and a knowledge graph, alongside a template-based surface realizer. We present a novel empirical evaluation method for quantifying the human-likeness and relevance of the generated responses in a ground-sourced, Turing test-like setting. We empirically show that including world knowledge in the input increases the generated responses' human-likeness, while not affecting perceived relevance.\footnote{This work is now published as \newcite{cagan14}}

\subsection*{Motivation and Background}\label{sec:intro}
As many of our day-to-day activities move online \cite{Viswanath:2009:EUI:1592665.1592675},  the importance of social media interactions to businesses \cite{Qualman-2012-socialnomics,Haenlein01092009} and governments \cite{Howard2011,Lamer2012} vastly increases. Therefore, natural language traffic in social media (blogs, microblogs, talkbacks) now enjoys vast monitoring and analysis efforts in such organizations. The increased importance of online communication has also led to many research advances that pertain to the analysis of social-media interactions: subjectivity and sentiment analysis \cite{Davidov:2010:ESL:1944566.1944594}, opinion mining \cite{Mishne06multipleranking}, affectiveness of online texts \cite{Danescu-Niculescu-Mizil:2009:ORO:1526709.1526729} and many more. In contrast to research on such {\em analysis} efforts, the question whether computer systems can {\em  generate} online content  to effectively interact with humans has been only sparsely attended to.\footnote{The only study we are aware of is \cite{Ritter:2011:DRG:2145432.2145500}, which uses a machine translation engine to generate responses to tweets. This work differs from our own in that it does not generate novel subjective responses, but rather, provides one-size-fits-all mechanism.}


\subsection*{Research Question}\label{sec:question}
Can a computer generate a fluent, relevant, and human-like response which effectively engages readers and clearly serves the responder's communicative goal?  The present paper addresses this problem of generating novel, subjective, responses to online opinionated articles  on behalf of an interested agent. We propose a formal model, an end-to-end implemented architecture, and an empirical evaluation method for a system that  generates such    responses. The generation process takes into account the user's agenda, the document's topics, sentiment status and, optionally, a knowledge graph. In addressing the question how such responses may be faithfully evaluated, we develop a Turing test-like method for quantifying the human-likeness and relevance of computer-generated responses in online settings.

\subsection*{The Solution}\label{sec:model}
In social media, natural language utterances are often employed as a  communicative device  serving a communicative goal, such as promoting the user's disposition towards some topic. To define this communicative goal, we define a {\em user agenda} as a set of topics associated with the user's sentiment. A triggering event for generating an utterance that serves such goal may be  a new online {\em document} which conveys an author’s sentiment toward some topics. When the agenda and the document content overlap, a response generation is triggered by our system.

Formally, assuming that \(A\) is a set of agendas, \(D\)  a set of documents and \(S\) a set of valid English sentences, we wish to implement the following: $f_{\rm response}:  {D}\times {A}\rightarrow {S}$, where \(S\) is an English sentence expressing the responder's beliefs or sentiments towards the document topics, and relative to that of the author. To implement this function we define a composite process containing two phases:
%
 (a) an analysis phase $p: {D}\rightarrow {C}$, and
 %
 (b) a generation phase  $g: {C}\times {A}\rightarrow  {S}$.

The goal of \(p\) is to extract the document's topic(s) and related sentiments (henceforth, a {\em content element}) and yield a set of content elements represented as \(c\in C\)). The generation phase takes as input the content elements extracted from the document as well as the content elements defined in the user's agenda and generates a response based on their intersection. The implementation of $p$ relies on a trained topic model \cite{Papadimitriou:1998:LSI:275487.275505,Hofmann:1999:PLS:312624.312649,Blei:2003:LDA:944919.944937} with an associated sentiment value,  \(sentiment_t \in [-n..n] \), defined for each document or  user agenda. The  implementation  of \(g\)   employs a template-based generation approach, as in  \newcite{Reiter:1997:BAN:974487.974490}, \newcite{VanDeemter:2005}.

The design of our template reflects the three Gricean maxims of communication \cite{grice}:  \emph{quantity} (responses are brief and concise),  \emph{relation} (responses directly address the documents’ content) and  \emph{quality} (responses express responder’s beliefs, sentiments, or dispositions towards the topic(s)). We  enforce these maxims through the templates' design: the responses length and density is controlled by the number substitution slots in the templates (quantity), templates directly incorporate user/document topics and sentiments,(relation).  and users’ opinions, perceived as their respective truth, define the response’s relation to the document content (quality).

\subsection*{Empirical Evaluation}\label{sec:eval}
Based on our architecture, we implemented and tested two systems. A baseline system as defined above, and an additional variant that also includes a knowledge-base that can be used to expand the response with a sentence on topics related to that of the document. Due to the large space of output possibilities there is no gold standard or ground-truth to compare our generated responses to, and we resort to human evaluation akin to the well-known {\em Turing test} \cite{Turing1950}. In our evaluation, we ask human participants to evaluate our system output as well as real human responses for the same articles snippets. We conducted two online surveys (using Amazon Mechanical Turk - \url{www.mturk.com}) in which we asked the participants to rate the human-likeness and relevance of the human and computer responses. In all cases, we  considered online articles on mobile devices, and simulated responses for a range of possible users agendas. Some response are generated with the addition of a knowledge-base, and others without.

\subsection*{Results} \label{sec:results}
Our generated responses achieve a computer-likeness rating higher than that of human responses (4.32 rating for the system and 3.33 for human responses), indicating that our ultimate goal is yet to be reached. In terms of relevance, our response scored 4.52 while human responses was at 4.85, indicating that in terms of relevance,  our generated responses is roughly at the same level as human responses. We additionally investigated, using regression analysis, what factors makes responses more human-like. Our results show that responses generated using world knowledge  are regarded as more human-like than those that rely on topic, sentiment and agenda only -- whereas the use of world knowledge does not affect perceived relevance. We also identified a learning effect of participants, getting better at identifying the computer responses over time, which we attribute to the repetitiveness in the use of our various templates.

\subsection*{Conclusion and Future Work}\label{sec:conclusion}
Our evaluation exposed  several strength and weaknesses of the models, which we aim to further investigate and improve on in future work. Firstly, we aim to use our empirical evaluation method to study online responses more comprehensively, towards identifying common linguistic characteristics. We then plan to use these linguistic characteristics for devising a more general grammar-based generation engine replacing our templates, combatting the learning effect  by adding more variance. On a different note, we plan to explore the use of a wide-scale knowledge base, such as Freebase \cite{Bollacker08} in order  to expand our output domain  and make responses more human-like, more diverse, and ultimately also more interesting.



\bibliography{talkbackref}
\bibliographystyle{acl}


\end{document}
